Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Merge retry logic #2719

Merged
merged 16 commits into from
Jun 17, 2024
Merged

Conversation

nasdf
Copy link
Member

@nasdf nasdf commented Jun 13, 2024

Relevant issue(s)

Resolves #2718
Resolves #2721

Description

This PR fixes issues with merge retry and DAG sync processes. It also moves the docQueue from the net package into the db package.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

make test

Specify the platform(s) on which this was tested:

  • MacOS

@nasdf nasdf added bug Something isn't working area/db-system Related to the core system related components of the DB labels Jun 13, 2024
@nasdf nasdf added this to the DefraDB v0.12 milestone Jun 13, 2024
@nasdf nasdf self-assigned this Jun 13, 2024
Copy link

codecov bot commented Jun 13, 2024

Codecov Report

Attention: Patch coverage is 82.50000% with 14 lines in your changes missing coverage. Please review.

Project coverage is 78.13%. Comparing base (6626441) to head (32634a1).

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2719      +/-   ##
===========================================
+ Coverage    78.00%   78.13%   +0.13%     
===========================================
  Files          310      310              
  Lines        23236    23188      -48     
===========================================
- Hits         18124    18116       -8     
+ Misses        3725     3693      -32     
+ Partials      1387     1379       -8     
Flag Coverage Δ
all-tests 78.13% <82.50%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
net/server.go 80.22% <71.43%> (+1.62%) ⬆️
internal/db/merge.go 74.12% <85.00%> (+3.90%) ⬆️
net/sync_dag.go 81.82% <81.82%> (ø)

... and 4 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6626441...32634a1. Read the comment docs.

Comment on lines 52 to 53
pushLogEmitter event.Emitter

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: If you remove the docQueue, it opens the door for transaction conflicts on the dag sync process. One of the test matrix run failed for this exact reason. I suggest you keep it in :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still there but in the db package instead. I think it makes more sense to have the blocking occur there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if the conflict happens in exactly the wrong way, blocks will be missing from the blockstore. I wanted to remove it when I refactored the dag sync process and I found out that we have to keep it in. Alternatively we need to change the dags sync logic to deal with conflicts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right I remember now. Would a simple fix be to ignore transaction conflict errors?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetBlocks doesn't handle errors very well. Even GetBlock doesn't return the block on put conflict so we would need to retry. It would be an efficien retry because the block would be found locally on the next try. I'll let you decide what to do from here :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated sync process should now handle conflicts.


var err error
// retry merge up to max txn retries
for i := 0; i < db.MaxTxnRetries(); i++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: If you use the merge queue, I expect that a retry will never be needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user updates a doc while a merge is in progress it could still happen.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅 very true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user updates a doc while a merge is in progress it could still happen.

I'd be super nice / reminder to document that here.

@@ -43,28 +45,37 @@ func (db *db) handleMerges(ctx context.Context, merges events.Subscription[event
return
}
go func() {
err := db.executeMerge(ctx, merge)
// ensure only one merge per docID
queue.add(merge.DocID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: If the DocID is now part of the message, we can get rid of the getDocIDFromBlock function. The reason I hadn't included DocID before is that I wanted to guarantee that there was no mistake with it. Like the wrong DocID for the provided CID. That was probably a bad reason when I think about it.

@nasdf nasdf marked this pull request as draft June 14, 2024 02:42
@nasdf nasdf marked this pull request as ready for review June 14, 2024 18:32
@nasdf nasdf requested review from fredcarle and a team June 14, 2024 18:32
Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Too bad we are reintroducing ipld-format after trying to get rid of it. We can switch over to an ipld-prime solution in the future.

@nasdf nasdf merged commit 07e431d into sourcenetwork:develop Jun 17, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/db-system Related to the core system related components of the DB bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partial DAG sync will never merge Merge retries can result in incorrect state
3 participants